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Abstract 



A set of linear conditions on the item response functions is derived that guarantees identical 
observed-score distributions on two test forms. The conditions can be added as constraints to 
a linear programming model for test assembly that assembles a new test form to have an 
observed-score distribution optimally equated to the distribution of an old form. For a well- 
designed item pool, use of the model results into observed-score pre-equating and prevents the 
necessity of post hoc equating by a conventional observed-score equating method. An 
empirical example illustrates the use of the model for an item pool from the Law School 
Admission Test. 
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Observed-Score Equating as a Test Assembly Problem 

A well-known method of observed-score equating is equipercentile equating. The 
method assumes that estimates of the observed-score distributions on the old and new test 
form are available and equates the observed scores on the new form to the scores on the old 
form estimating the same percentile in the population of examinees. With the advent of item 
response theory (IRT) (Lord, 1980; Rasch, 1960; Hambleton & Swaminathan, 1985; van der 
Linden & Hambleton, 1997), new methods of equating have become available. These methods 
assume that the items in the two test forms have been calibrated on the same scale for the 
ability parameter in the IRT model. In one method, the response functions are used to generate 
the observed-score distributions on both test forms, and the equipercentile method is 
employed to find the transformation that equates the two distributions. Another method uses 
the test characteristic functions of the two tests as a system of parametric equations that 
equates the true scores on the two tests. If the two tests have high reliability, true-score 
equating is often used as an approximation to observed-score equating. An introduction to 
equipercentile and IRT equating is given in Kolen and Brennan (1995); IRT equating is also 
discussed in Lord (1980). 

Under IRT, tests from the same pool are automatically scored on the same scale for the 
ability parameter. From a theoretical point of view, it seems therefore superfluous to equate 
the observed-score scale as well. Nevertheless, practical reasons for this additional equating 
exists. Many testing programs had already fixed their score scales before IRT was introduced 
and replacing them by ability estimates with a more complicated relation to the response 
vectors than number-right scores might have been difficult to explain to their examinees. In 
addition, since the ability scale has a nonlinear relation to the observed-score scale, the sudden 
change of score distributions could have confused these examinees too. It is therefore not 
uncommon to find testing programs using IRT for such routines as item parameter estimation, 
screening of item quality, test assembly, and test equating but reporting their scores still on a 
traditional scale. 

In these programs, tests are usually assembled from a large pool of items regularly 
replenished by new items calibrated on the ability scale of the pool. It is the purpose of this 
paper to show that, provided their item pools are well designed, such programs can omit test 
equating if the new test form is assembled to have an observed-score distribution identical to 
the one on the standard test to which new forms have been equated so far. As shown in this 
paper, the result can be obtained imposing a simple set of linear constraints on the response 
functions of the new test form. Use of these constraints to pre-equate observed-score 
distributions has several practical advantages: 
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1. The results hold for any population of students for which the calibration of the 
item pool is valid no matter its ability distribution; 

2. No resources are lost running separate equating studies; 

3. Scores on the new test form can be reported immediately after the test 
administration; 

4. Unlike current equating practice, the scale of the observed scores on the new 
test form is not distorted by a (nonlinear) score transformation, and the scores 
thus keep their interpretation as number- right scores; 

5. The scores on the two forms are equitable in that the procedure ensures identity 
of the conditional distributions of observed scores on the two forms for each 
possible ability level— a condition generally not met when using one of the 
existing equating methods (Lord, 1980, sect. 13.2). 

In the remainder of the paper, first the theory of equipercentile and IRT equating is 
briefly reviewed. Then, a set of simple conditions on the item response function guaranteeing 
two test forms to have identical observed-score distributions is derived. The set replaces an 
earlier approximate condition given in van der Linden and Luecht (1996). The conditions are 
linear in the items and can be included in a linear-programming (LP) model for test assembly 
that optimizes the composition of the new test subject to the other test specifications already 
in use. Next, it will be indicated how the method can be generalized to deal with item pools 
dependent on more than one ability variable as well as other scoring systems than number 
correct. Use of the test assembly model is empirically illustrated for an item pool from the 
Law School Admission Test (LSAT). 

Equating Transformations 

The following notation is needed to present the equating transformations. Index 
j=l,...,n will be used to denote the items in the old test form, whereas i is used to denote the 
items in the new test form (i=l,...,n) or in the pool from which the form is assembled 
(i= !,...,!). Responses by examinee a to item i or J will be represented by random variables 
Uai=Uai and Uaj=Uaj, respectively. Number-correct scores for examinee a on the two tests are 

defined as Xa=S[LiUia and Ya=Ij=iUaj, with true scores Tx, = E(Xa) and TY, = E(Ya), 
respectively. Finally, it is assumed that X and Y have cumulative distribution functions F(x) and 
G(y). 

In equipercentile equating, both test forms are administered to a single sample or two 
independent random samples from the population of examinees to estimate the 
^rj^nsformation, e(x), that maps score X on the scale of score Y: 
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e(x) = G-'(F(x». ( 1 ) 

The first step in this transformation identifies x as a percentile under the distribution of X; the 
second step equates x to the same percentile under the distribution of Y. 

To discuss the mathematical equations involved in IRT equating, it is assumed that the 
responses to the items in the two test forms fit the 3-parameter logistic (3-PL) model. The 
model gives the probability of a correct response Uai=l as 

Pi(0) = Ci + (l-Ci)[l + exp(-ai(0a-bi))l‘, (2) 

where (- 00 , 00 ) is a parameter for the ability of examinee a, bi€ (- 00 , 00 ) and ai€[0,oo) are 
parameters for the difficulty and discriminating power of item i, respectively, and c\e [0,1] is 
the guessing parameter of the item. The 3-PL model is chosen because it was used to calibrate 
the LSAT item pool in the empirical example at the end of this paper. If h(0) is the density of 
the ability distribution in the population of examinees, the probability functions of X and Y are 
given by 



f(x)= Jpx(xie)h(e)d0 



(3) 



and 



g(y)= JpY(yie)h(e)d0, 



(4) 



where px(xl0) and py(xI^) are the probability functions of the conditional distributions of X 

and Y given 6 , respectively. These conditional distributions are generalized binomial (Lord, 
1980, sect. 4.1; Kendall & Stuart, 1977, sect. 5.10). 

In IRT observed-score equating, the probability functions f(x) and g(y) are estimated 
from a random sample of examinees, estimates of the cumulative distribution functions F(x) 
and G(x) are calculated, and (1) is used to estimate the transformation from X to Y. Two 
alternative methods to implement IRT observed-score equating are discussed in Zeng and 
Kolen (1995). 

In IRT true-score equating, the fact is used that the true scores of X and Y are equal to: 
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Tx = Tx( 0 ) = SPiW 

|s:| 



(5) 



Ty = Ty( 0) = iPi(0). 

is| 



( 6 ) 



These two equations, known as the test characteristic functions of form X and Y, define the 
(parametric) relation between Tx and Ty that can be used to equate the true score of X to the 
one of Y. The fact that the equations in (5)-(6) are simpler to apply than the procedure based on 
(3)-(4) motivates the use of true-score equating as an approximation to observed-score equating 
for large tests. 



Conditions for Observed-Score Distributions to be Identical 

From the probability functions in (3) and (4) it is clear that since the ability distribution 
is common, the observed-score distributions of X and Y are identical if the conditional 
distributions of X and Y given 6 are. This fact is used in the proof of the following proposition. 

Proposition 1 . For any h(0), the distributions of observed scores X and Y are identical if 

Zp:W = ipsw, for 1^1 n. (7) 

i=l j=l 

Proof . The distribution of X given 6 has no probability function in closed form but its 
probabilities can be obtained via the generating function n[Li(Qi + Pi). In addition, this 
probability generating function is known to have the following expansion in the powers of 
Pi-f, P2-C, P„-f : 



Prob{X=x) = Pn(x)+-| V2C2(x)+-|v3C3(x) 

+ (Jv4-yVi)C4(x) + (yV5-^V2V3)C5X+...,x=0,l,...,n, (8) 

with 

er|c 
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Pn(x) = 



r(i-cr, 



( 9 ) 



Cr(x)= X(-ir‘ 



v=0 




p„.,(x-v), r=2,...,n, 



( 10 ) 



n 




i=l 



( 11 ) 



where ^ is defined as n'*ZiliPi (Walsh, 1953, 1963; Lord & Novick, 1968, sect. 23.10). 
Because (8) is an exact identity, the distributions of X and Y given G are equated if the 
expressions in (9)-(ll) are equal. For (9)-(10) this condition is realized if ^ is equal for both 
tests, that is, if (7) is true for r=l. In addition, (1 1) can be written as 



Substitution of r=2,...,n into (12) shows that these expressions are equal for both tests if (7) 
holds for r=l,...,n. These two conclusions establish the proposition.* 

The following two propositions establish useful relations between (7) and (8). 

Proposition 2 . For each integer value of R<n, the set of equalities in (7) obtained for 
r=l,...,R equate the first R terms of the series in (8). 

The truth of this proposition follows immediately from the substitution at the end of the proof 
of Proposition !.■ 

Proposition 3 . For n — > » , the contributions of the terms in (8) of the order r>2 vanishes. 

Proof . For r=l, the condition in (7) formulates that the test characteristic functions in 
(5)-(6) be equal. Thus, equal test characteristic functions imply that the two distributions have 
identical first terms for (8), and true-score equating can be viewed as a first-order 
approximation to observed-score equating. Since the observed-score distribution converges to 
the true-score distribution if test length increases, this first-order approximation improves with 
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test length. As a result, for longer tests the contribution of the higher-order terms in (8) 
decreases.* 

The practical implication of the proposition is that the series in (8) can be truncated after a few 
terms. The same can be done for the test assembly model below imposing (7) on the selection 
of the items only for small values of r. In the empirical example later in this paper only the 
first three terms were used. 

An interesting consequence, however, is obtained if all n equalities in (7) are imposed. 
The set of conditions is then equivalent to the one of the response functions of the two tests 
being pairwise identical. This property as well as its proof were suggested by N. D. Verhelst 
(personal communication, November 1, 1996): 

Proposition 4 . The conditions in (7) hold simultaneously for all values of r if and only 
if the two tests have pairwise identical response functions. 

Proof . If the two tests have pairwise identical response functions, then the conditions 
in (7) hold trivially. The proof of the reverse implication is based on the idea to define 
probability spaces over the two sets of response functions and to invoke the principle of 
moments (Kendall & Stuart, 1977, sect. 4.22). Thus let be <Xo^P(Xo)^P> a (finite) 
probability space, where x is the set of response probabilities in test X for a fixed value of G , 
PiXo) is the power set of X$ and p is the (uniform) probability function p(i)= 1/n. In addition, 
a random variable X^(i) = Pi(0) is defined. An analogous probability space and random variable 
is defined for test Y. The conditions in (7) stipulate that the first n moments of the distributions 
induced by the two random variables are identical. Therefore, the two distributions are identical; 
that is, for each value of X^(i)=Pi(^) there exist a value Y0(j)=Pj(0), and vice versa. Because 
the argument holds for an arbitrary value of G , the pair of functions Pj( 0 ) and Pj( 0 ) have more 
than two points in common and are identical.* 

Note that two tests with pairwise identical response functions have equal true scores 
and observed-score variances for each examinee in the population for which the IRT model 
holds. These tests therefore yield parallel measurements (Lord & Novick, 1968, definition 
2.13.1). Proposition 4 thus shows the (stringent) conditions in IRT under which the classical 
definition of parallel measurements hold. 

Proposition 4 also implies that a test assembly model with a larger number of the 
equalities in (7) imposed on the new test might be a convenient one-stage alternative to the 
two-stage approaches to item matching proposed by van der Linden and Timminga (1988) 
also Armstrong & Jones, 1992). However, as the focus of this paper is on the less 
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demanding problem of observed-score equating, the latter suggestion is not further elaborated 
here. 



Because the conditions in (7) are linear in the items, they can be used as objective 
function and/or constraints in an LP model for optimal test assembly. Such models have been 
proposed earlier, for example, to assemble tests to match a target information function 
(Swanson & Stocking, 1993; Theunissen, 1985; van der Linden & Boekkooi-Timminga, 
1989), to assemble sets of parallel test forms (Adema, 1992; Armstrong, Jones & Wu, 1992 ; 
Boekkooi-Timminga, 1987, 1990), to maximize classical test reliability (Adema & van der 
Linden, 1989; Armstrong, Jones & Wang, 1994), to match tests item by item (Armstrong & 
Jones, 1992; van der Linden & Boekkooi-Timminga, 1988), or to implement constrained 
adaptive testing (van der Linden & Reese, in press). In addition, these models allow for all 
other test specifications typically constraining the selection of items in a testing program. 

Following is the model proposed to select a new test form from a pool of items with an 
observed-score distribution optimally equated to the distribution of an old form. Let xj, 
i= be the decision variables to denote whether (Xi=l) or not (Xj=0) item i is included in 
the new test form. Because the first terms in (8) are most important, the idea is to choose the 
values of Xi such that the differences between the two left-hand and right-hand sums in (7) are 
minimized for r=l,...,R at > k=l,...,K. As already noted R can be small. Also, since the item 
response functions in (7) are well-behaved continuous functions, only a few points are necessary 
to control their shapes over the range of 9 values considered. However, there are no limitations 
as to the number of values and their spacing, and test assemblers are free to select the set best 
fitting their needs. 

The model is as follows: 

minimize y ( 13 ) 



Test Assembly Model 



subject to 



ZpfCeOxi - ZPj( 0O < y. 




( 14 ) 



lPi'(0k)xi - ZpsC^O s -y. 



k=l,...,K;r=l,...,R 



ERIC 
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1 

Zxi = n, ( 16 ) 

i=l 

I 

ZqisXi ^ rs, s=l,...,S, (17) 

Z Xi = "s, s=l,...,S, (18) 

ieVs 

XiG 0,1, (19) 

y ^ 0, (20) 



The constraints in (14)-(15) require the difference between Zi=iP[(^k)xi and I^iPj(0k) to be 
in the interval [-y,y], y^, for k=l,...,K and r=,...,R whereas the objective function in (13) 
minimizes y. The model therefore effectively minimizes the largest difference between these 
sums over the 6 values selected and is thus of the minimax type. The constraint in (16) sets the 
length of the new test form equal to n. The constraints in (17)-(18) deal with all possible 
additional test specifications. For example, if the total length of the test measured by its number 
of lines should be smaller than a given number rs, qjs can be defined as the number of lines in 
item i, and the constraint in (17) guarantees the result. Likewise, if some of the items in the 
pool measure the application of certain skills, the set Vs in (18) can be chosen to be this subset 
of items and the constraint guarantees the selection of ns items from it. Various other types of 
constraints are possible; for a review see van der Linden (in press) or van der Linden and 
Boekkooi-Timminga (1989). Finally, the constraints in (19)-(20). define the range of the 
decision variables. 

As already noted, the series in (8) approximates the distribution generally good for 
only a few terms but that the precision of the results goes up if the upper bound R in (14)-(15) 
is increased. This result is only analytical, however; the actual problem is one of 
combinatorial optimization. In practice, however item pools are finite and not all possible 
combinations of values for the item parameters are available. As a consequence, imposing too 
many of the conditions in (7) may occasionally lead to item combinations compromising 
between the conditions with results slightly worse than those for a case of fewer conditions. 
Though weights could be added to the right-hand side variables in the individual constraints in 
(14)-(15), it is hard to base the choice of their values on a theoretic argument. In the empirical 
example below it proved best to include constraints for the first two or three terms in (8) in the 
Q del and to apply no weighing. 
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The above model can be solved for optimal values of Xj and y using standard LP 
software or the test assembly software package ConTEST (Timminga, van der Linden & 
Schweizer, 1996). For models with a special structure, heuristics as in Luecht and Hirsch 
(1992) are convenient. The choice of algorithm to solve the model is further addressed in the 
presentation of the empirical example below. 

Multidimensional Ability 

A potential danger to IRT-based equating is lack of fit of the response model to the 
data. Such lack of fit is most likely due to the fact that success on the item pool can be 
dependent on more than one ability. An obvious remedy is to use a multidimensional response 
model. A well-known model is the following extension of the 2-PL model, i.e., the model in 
(2) withci=l for i=l,...,n: 



Pi(0) = P(Ui=ll0i,...,0D,ail,...,aiD,bi) 

D 

exp(2aid0d -bi) 

_ d = l 

“ D 9 

1 "t" exp( aid 1 0d " bi) 

d=l 



( 21 ) 



where , d=l,...,D, are the ability variables, aid is the parameter for the discriminating power 
of item i along , and bj is a parameter for the composite difficulty of item i. Detailed 
information about the model is given in McKinley and Reckase (1983), Reckase (1985, 1997), 
and Samejima (1974). To equate tests measuring possible multidimensional abilities, Glas 
(1992) uses a multidimensional Rasch or 1-PL model. The model is equivalent to the one in 
(21) with ajd=l for all i and d, but assumes that the items display a "simple structure" with 
respect to their dependencies on the ability variables; that is, the success on disjoint subsets of 
items in the pool is modeled as being dependent on different unidimensional ability variables. 
In addition, the individual abilities are linked by the assumption of a multivariate normal 
distribution for the population of examinees. 

Test assembly from a multidimensional item pool requires a slight adaptation of the 
model. The only changes necessary are substituting the multidimensional response functions 
into the conditions in (7) and specifying a multidimensional rather than a unidimensional grid 
of ability values for the constraints. That is, (14)-(15) has to be replaced by 

I n 

SPf(0dk)xi - ZPj(0dk) < y, d=l,...,D; k=l,...,K; (22) 

i.i j=i 
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1 n 

EPi(0dk)xi - EPj(0dic) > -y, d=l,...,D; k=l,...,K; (23) 

i=l j=l 

In the model by Glas, the grids remain unidimensional but different grids have to be specified 
for the separate ability variables for the different subsets of items in the pool. 

Other Scores than Number Correct 

Test results are often reported as a conversion of the number-correct score. If the 
conversion is a monotonically increasing function, the test assembly model in this paper can 
just be applied to the number-correct scores which are then converted to the desired scale 
afterwards. Examples of conversions to which this principle applies are changes of origin 
and/or unit of number-correct scores and "formula scoring" to correct for possible random 
guessing on multiple-choice items. 



Empirical Example 

The test assembly model was applied to a former pool of 753 items from the Law 
School Admission Test (LSAT) program. The items in the pool were calibrated using the 3- 
PL model in (2). The pool consisted of items falling into three different content categories, 
labeled SA, SB, and LA here. In addition, items varied in (sub)type, gender and minority 
orientation, answer key, and word count. Finally, a portion of the item pool had a set structure 
with items in the same set sharing a common stimulus. The type of stimulus varied in content 
description. All existing specifications for the LSAT were modeled as linear constraints 
following the general format in (17)-(18). To model the inclusion of item sets in the test, a 
second type of decision variables was needed in addition to the variables Xi in (13)-(20). In all, 
the model had 729 variables and 433 constraints. An old test assembled by hand to meet 
several specifications of the LSAT was known to the authors. The model in (13)-(20) was 
used to assemble new tests of 75 items with observed-score distribution optimally equated to 
the distribution on the old test. 

The model was solved using the First Acceptable Integer Solution heuristic as 
implemented in the ConTEST program. The heuristic first calculated an upper bound to the 
value of the objective function in (13) relaxing the other decision variables in the model and 
then performs a branch-and-bound search for the optimal solution that is stopped when the 
first integer solution with objective function value within a small tolerance from the upper 
bound is found. The search is speeded up using the optimal reduced costs in the solution to the 
relaxed model to fix some of the decision variables. For further details, see Timminga, van der 
Linden and Schweizer (1996, sect. 6.6.5). In the current application, the search for a 0-1 



Observed-Score Equating - 12 



solution was stopped as soon as the value of the objective function, y, was smaller than .01. 
The observed-score distributions for the old and new tests were generated according to (3)-(4), 
with 9 distributed as N(0,1), using a recursive algorithm introduced in Lord and Wingersky 
(1984). 





e 



Figure 1 . Probability functions of observed-score distributions (upper panel) and condition in 
^ (lower panel) for r=l and 0=0 (solid line: new test; dashed line: old test). 
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Two sets of results were calculated, both for r=l,...,3 but one with the response 
functions controlled only at 0=0 and the other at 0i=-l.O and 02=1-O- The probability 
functions of the observed-score distributions for the first set are plotted in Figures 1-3. 





Figure 2 . Probability functions of observed-score distributions (upper panel) and conditions in 
^ ^'7) (lower panel) for r=l,2 and 0=0 (solid line: new test; dashed line: old test). 
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The figures also show the extent to which the the sums of the rth power of the response 
functions for the old and new tests are equal. 





Figure 3 . Probability functions of observed-score distributions (upper panel) and conditions in 
(7) (lower panel) for r= 1,2,3 and 0=0 (solid line: new test; dashed line: old test). 
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The general impression is that the probability functions of the old and new tests are nearly 
identical in all four cases, with perfect results for r=l,2,3. In this case, the sums of the three 
powers of the response functions are also identical for all practical purposes. The sets of 
curves for second case are given in Figures 4-6. 





e 



Figure 4 . Probability functions of observed-score distributions (upper panel) and condition in 
O ) (lower panel) for r=l and 6\ =-1.0 and 02 =10 (solid line: new test; dashed line: old test). 
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Figure 5 . Probability functions of observed-score distributions (upper panel) and conditions in 
(7) (lower panel) for r= 1 ,2 and d\ -- 1 .0 and 02 = 1 -0 (solid line: new test; dashed line: old test). 
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Figure 6 . Probability functions of observed-score distributions (upper panel) and conditions in 
(7) (lower panel) for r=l,2,3 and 0i=-l.O and 02=1-0 (solid line: new test; dashed line: old 
test). 






For r=l the fit is comparable to the one obtained for the previous case. The only change is a 
shift of the new distribution lightly to the left. For r=l,2 the results are perfect. The slight 
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decrease in fit for r= 1,2,3 is due to the size of the item pool. As noted earlier, if the item pool 
does not have all possible combinations of parameter values, the goal to find a good 
compromise between all conditions may lead to worse result, in particular if the new 
condition, such as the one for r=3, only has a slight impact on the shape of the observed-score 
distribution. 



Discussion 

The success of the test assembly model proposed in this paper is predicated on the 
quality of the item pool. If the pool is small relative to the size of the test or not well designed, 
the observed score distribution of the test assembled from the pool may fit the distribution of 
target test not as well as in the empirical example above. However, in such cases the use of 
the test assembly model is still recommended; its results guarantee that the additional 
transformation necessary to equate the two test forms exactly involves a minimal distortion of 
scale over all possible test forms from the pool. Whether or not additional equating is 
necessary can immediately be inferred from such output as in Figures 1 -6. The observed-score 
distributions of the two test forms in these plots are all that is needed to perform additional 
equipercentile equating. This equating can take place before the test is administered. 

The quality of the item pool is also determined by the quality of item calibration and 
the fit of the (unidimensional or multidimensional) IRT model. The robustness of the results 
in this paper against item calibration errors or item misfit has not been examined yet. 
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